I used a subset of the ‘prosperLoanData.csv’ to create a data frame with 15 columns. I chose variables that may give insight into loan criteria for granting a loan and determining interest rate.
The dimensions of the data frame are:
## [1] 113937 15
Variables:
## [1] "LoanStatus" "BorrowerRate"
## [3] "EmploymentStatusDuration" "EmploymentStatus"
## [5] "CreditScoreRangeLower" "CreditScoreRangeUpper"
## [7] "FirstRecordedCreditLine" "AmountDelinquent"
## [9] "DebtToIncomeRatio" "IncomeRange"
## [11] "IncomeVerifiable" "StatedMonthlyIncome"
## [13] "LoanOriginalAmount" "LoanOriginationDate"
## [15] "LP_NetPrincipalLoss"
Data frame structure:
## 'data.frame': 113937 obs. of 15 variables:
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ EmploymentStatusDuration: int 2 44 NA 113 44 82 172 103 269 269 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1/1/00 0:00",..: 1052 5096 9250 4516 4254 4784 11315 4433 5809 5809 ...
## $ AmountDelinquent : int 472 0 NA 10056 0 0 0 0 0 0 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0 ","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "1/10/06 0:00",..: 1729 880 35 314 1783 541 974 1099 476 476 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
Levels of LoanStatus variable:
## [1] "Cancelled" "Chargedoff"
## [3] "Completed" "Current"
## [5] "Defaulted" "FinalPaymentInProgress"
## [7] "Past Due (>120 days)" "Past Due (1-15 days)"
## [9] "Past Due (16-30 days)" "Past Due (31-60 days)"
## [11] "Past Due (61-90 days)" "Past Due (91-120 days)"
Levels of IncomeRange variable:
## [1] "$0 " "$1-24,999" "$100,000+" "$25,000-49,999"
## [5] "$50,000-74,999" "$75,000-99,999" "Not displayed" "Not employed"
Summary of the data frame:
## LoanStatus BorrowerRate EmploymentStatusDuration
## Current :56576 Min. :0.0000 Min. : 0.00
## Completed :38074 1st Qu.:0.1340 1st Qu.: 26.00
## Chargedoff :11992 Median :0.1840 Median : 67.00
## Defaulted : 5018 Mean :0.1928 Mean : 96.07
## Past Due (1-15 days) : 806 3rd Qu.:0.2500 3rd Qu.:137.00
## Past Due (31-60 days): 363 Max. :0.4975 Max. :755.00
## (Other) : 1108 NA's :7625
## EmploymentStatus CreditScoreRangeLower CreditScoreRangeUpper
## Employed :67322 Min. : 0.0 Min. : 19.0
## Full-time :26355 1st Qu.:660.0 1st Qu.:679.0
## Self-employed: 6134 Median :680.0 Median :699.0
## Not available: 5347 Mean :685.6 Mean :704.6
## Other : 3806 3rd Qu.:720.0 3rd Qu.:739.0
## : 2255 Max. :880.0 Max. :899.0
## (Other) : 2718 NA's :591 NA's :591
## FirstRecordedCreditLine AmountDelinquent DebtToIncomeRatio
## : 697 Min. : 0.0 Min. : 0.000
## 12/1/93 0:00: 185 1st Qu.: 0.0 1st Qu.: 0.140
## 11/1/94 0:00: 178 Median : 0.0 Median : 0.220
## 11/1/95 0:00: 168 Mean : 984.5 Mean : 0.276
## 4/1/90 0:00 : 161 3rd Qu.: 0.0 3rd Qu.: 0.320
## 3/1/95 0:00 : 159 Max. :463881.0 Max. :10.010
## (Other) :112389 NA's :7622 NA's :8554
## IncomeRange IncomeVerifiable StatedMonthlyIncome
## $25,000-49,999:32192 Mode :logical Min. : 0
## $50,000-74,999:31050 FALSE:8669 1st Qu.: 3200
## $100,000+ :17337 TRUE :105268 Median : 4667
## $75,000-99,999:16916 NA's :0 Mean : 5608
## Not displayed : 7741 3rd Qu.: 6825
## $1-24,999 : 7274 Max. :1750003
## (Other) : 1427
## LoanOriginalAmount LoanOriginationDate LP_NetPrincipalLoss
## Min. : 1000 1/22/14 0:00 : 491 Min. : -954.5
## 1st Qu.: 4000 11/13/13 0:00: 490 1st Qu.: 0.0
## Median : 6500 2/19/14 0:00 : 439 Median : 0.0
## Mean : 8337 10/16/13 0:00: 434 Mean : 681.4
## 3rd Qu.:12000 1/28/14 0:00 : 339 3rd Qu.: 0.0
## Max. :35000 9/24/13 0:00 : 316 Max. :25000.0
## (Other) :111428
A quick assessment shows most loans go to people with an income between $25,000 and $74,999. 16628 loans out of 112635 were in ‘default’ or ‘charged off at a loss’. 93747 loans out of 112635 were ‘Current Loans’( payments on time) or ‘Completed Loans’.
I am going to look at each variable one at a time to get a feel of the distributions and ranges.
First LoanStatus
## [1] "Cancelled" "Chargedoff"
## [3] "Completed" "Current"
## [5] "Defaulted" "FinalPaymentInProgress"
## [7] "Past Due (>120 days)" "Past Due (1-15 days)"
## [9] "Past Due (16-30 days)" "Past Due (31-60 days)"
## [11] "Past Due (61-90 days)" "Past Due (91-120 days)"
I’m going to change the factor variable LoanStatus to an ordered factor variables with bad LoanStatus to good loanStatus going from left to right. This will hopefully enable us to read the data easier and maybe find some trends.
Figure 1
Most loans are in good status, ‘Current’ and ‘Completed’.
I will convert EmploymentStatus to a more readable ordered factor.
## [1] "" "Other" "Not available" "Not employed"
## [5] "Part-time" "Self-employed" "Full-time" "Employed"
## [9] "Retired"
Figure 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
We see most borrowers have a ‘BorrowerRate’ of between 0.13 and 0.25. The x-axis was log transformed and sqrt transformed to see if there was a bimodal distribution. There was not a bimodal distribution.
Figure 3
A question for later is does the length of time a person has been working or unemployed reflect on their ability to pay back a loan?
The x-axis was log10 and sqrt transformed. The square root function shows more of a bell shaped curve that could be used in modeling.
Figure 4
Above shows most people have their credit history starting form between 1990 and 2010. It also shows an anomaly. We see some people have FirstRecordedCreditLine in the 2060 area. This can’t be.
Figure 5
Pre 2020_01_01 loan summary to investigate anomaly:
## Min. 1st Qu. Median Mean 3rd Qu.
## "1969-01-01" "1990-07-16" "1995-11-11" "1995-01-07" "2000-03-20"
## Max.
## "2012-12-22"
Post 2020_01_01 loan summary to investigate anomaly:
## Min. 1st Qu. Median Mean 3rd Qu.
## "2047-08-24" "2063-07-27" "2065-12-01" "2065-01-08" "2067-08-01"
## Max.
## "2068-12-30"
Using the subset function for dates before and after the date “2020-01-01” shows something that may explain the anomaly. The subset before “2020-01-01” have a range of dates starting on “1969-01-01”. The subset after “2020-01-01” have a range of dates ending on “2068-12-30”. This implies that any dates entered before “1969-01-01” were incorrectly stored as a 2000’s date instead of a 1900’s date. Whatever the reason is, these anomalous dates will be removed.
Figure 6
That’s better! Now there are no credit history starting dates near 2060.
Figure 7
This LoanOriginationData shows a bimodal distribution. There seems to be no loans originating from late 2008 to early 2009. This is probably related to the market crash. The bin width was reduced to see if there was another break at about 2013. There were no other breaks.
It would be interesting to see if the loans from before 2009-01-01 had a default rate different than the loans after 2009. So I will create a factor variable “Jan_2009_loans” with 2 factors “pre2009” and “post2009”.
The count pre2009 and post2009 are:
## pre2009 post2009
## 28124 84584
I will create a variable ‘CreditHistoryLength’ by subtracting the dates ‘FirstRecoredCreditLine’ from the ‘LoanOriginationDate’.
Figure 8
We see a peak of CreditHistoryLength at abour 5000 days and some go higher than 15000 days( 41 year).
Most people do not have an ‘AmountDelinquent’ on any of their loans when they apply for this loan. So many ‘0’s spike the graph at x=0. These were left out here to give a better idea of the distribution of the no zero values.
Figure 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 223 862 3030 3648 23100
Most borrows have less than $900 delinquent on any of the current loans when they apply for these loans.
DebtToIncomeRatio have some extreme values.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.277 0.320 10.010 8383
The variable definition sheet states that if a borrowers ‘DebtToIncomeRatio’ of above 1000% (10.00) it be entered as 1001%(10.01). These are not outliers. These are real data point with extreme values.
Figure 10
Above the DebtToIncomeRatio extreme values were removed to visualize the data.
To visualize the extreme values here is the DebtToIncomeRation data with values greater than 2, on the x-axis.
Figure 11
We will look closer to at these extreme values later in the multivariate section.
Figure 12
That’s hard to read because the amounts are out of order. So, I’ll make it an ordered factor.
Figure 13
That’s better. The incomes increase from left to right. Most people make 25-75,000 dollors, about twice as much as the 75-100,000 range.
Figure 14
Only a small number of borrowers did not have their income verified. We will see if it made a difference in the default or charge off rates.
Figure 15
Right tail distribution with a median at 6500.
Figure 16
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1769 3285 4684 5979 25000
Wow! The Maximun LP_NetPrincipalLoss was $25000 and 25% (above 3rd quantile) of the loans, with a loss, have a loss greater than $5793.
New variables:
LoanQuality
## bad_loan undetermined_loan good_loan
## 16613 2260 93767
A variable LoanQuality was created with 3 ordered factors. These factors are bad_loan, undetermined_loan and good_loan.
I will create a variable AvgULCreditScore, but first I must do some data wrangling to remove outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 660.0 680.0 685.9 720.0 880.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.0 679.0 699.0 704.9 739.0 899.0
After further investigation I realized there are 73 data points with a low and high score of 0 and 19.
The next highest numbers are in the mid 300 ranges. Therefore these odd scores removed.
Figure 17
An ‘AvgULCreditScore’ variable was created by removing outliers and averaging ‘CreditScoreRangeLower’ and ‘CreditScoreRangeUpper’.
##
## (300,400] (400,500] (500,600] (600,700] (700,800]
## 1 525 6061 52565 48598
A variable CreditScore.bucket was created.
The new variable “profit” was created by multiplying BorrowerRate and LoanOriginalAmount.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 686 1208 1458 2024 8750
The dataset is a data frame and has 15 variables selected from the 81 variables from the ‘prosperLoanData.csv’. 5 are continuous int variables, 5 are continuous, 2 are factor variables, 2 are date format variables and one logic variable. There are 113937 rows or data points. Each representing a single loan. ### What is/are the main feature(s) of interest in your dataset? I’m interested in trying to find out what criteria were used to approve loans and how the interest rate was determined. Additionally, I will look for factors that are common in bad loans and good loans.
The BorrowerRate variable will be use to find correlations or trends with other variables. This may give us information about how the interest rate was determined.
LoanStatus will be a main feature that will help me determine good verses bad loans. The main factors that will help determine good loans are ‘Completed’ and ‘Current’ form the LoanStatus variable. And the factors that determine bad loans are ‘Chargedoff’ and ‘default’. LoanQulaity variable was derived from these LoanStatus variable factors.
LoanQuality, LP_NetPrincipalLoss and profit variables will be used to find factors related to good loans and bad loans.
LP_NetPrincipalLoss shows if a borrower has a net loss of principal.
So LoanQuality, LP_NetPrincipalLoss and profit will be used to evaluate which of the other variables are or are not used to determine if a loan is granted. Also, which loans pay back principal and interest.
I created a new variable ‘LoanQuality’ from the ‘LaonStatus’ variable. The factors ‘Completed’ and ‘Current’ form the LoanStatus variable will be the factor ‘good_laons’ in the new ‘LoanQuality’ variable. Also, the factors ‘Chargedoff’ and ‘default’ from ‘LoanStatus’ will be ‘bad_loans’ in the ‘LoanQuality’ variable.
It would be interesting to see if the loans from before 2009-01-01 had a default rate different than the loans after 2009. So I will create a factor variable ‘Jan_2009_loans’ with 2 factors “pre2009” and “post2009”. This is described more below.
A CreditScore.bucket was created.
Also, ‘NoNetLoss’ was created from LP_NetPrincipalLoss. If LP_NetPrincipalLoss is zero NoNetLoss would be TRUE. If LP_NetPrincipalLoss is greater than zero then NoNetLoss would be false.
A variable “profit” was created by multiplying LoanOriginalAmount and BorrowerRate.
New variables: ‘LoanQuality’ ‘AvgULCreditScore’ ‘CreditHistoryLength’ ‘Jan_2009_loans’ ‘CreditScore.bucket’ ‘NoNetLoss’ ‘profit’
The LoanOriginationData show a bimodal distribution. There seems to be no loans originating from late 2008 to early 2009. This was probably caused by the market crash. It would be interesting to see if the loans from before 2009-01-01 had a default rate different than the loans after 2009-01-01!!
FirstRecordedCreditLine had an anomoly were dates added before 1/1/1969 were entered as 2000’s year instead of 1900’s. These dates were removed.
1)The ‘FirstRecordedCreditLine’ variable shows most people have their credit history starting form between 1990 and 2010. It also shows an anomaly. We see some people have FirstRecordedCreditLine in the 2060 area. This can’t be.
Using the subset function for dates before and after the date “2020-01-01” shows something that may explain the anomaly. The subset before “2020-01-01” have a range of dates starting on “1969-01-01”. The subset after “2020-01-01” have a range of dates ending on “2068-12-30”. This implies that any dates entered before “1969-01-01” were incorrectly stored as a 2000’s date instead of a 1900’s date. Whatever the reason is, these anomalous dates will be removed.
2)I converted the IncomeRange, LoanStatus, EmploymentStatus variable to an ordered factor. I did this to make these easier to read with the factors going from left to right with improvment.
3)CreditGrade, was removed. 84984/113937 Unusual distribution. The summary function shows that the ‘CreditGrade’ variable has 84984 blank or NA’s. The Variable Definitions sheet that came with this data set defines ‘CreditGrade’ as “The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period and will only be populated for those listings.” This explains that most of the data(84984 loans) are post 2009 and not populated with this credit grade. That is 74.6% of the 113937 loans in this data set. Therefore, this ‘CreditGrade’ variable is not a good representation of the borrowers credit for this data set and will not be used.
1)BorrowerRate
Let’s see if we can find some of the factors that may have been used to determine rate(BorrowersRate).
Figure 18
There is a significant negative corrleation (-0.331) between BorrowerRate and LoanOriginalAmount. There is a correlation but I don’t think the bank would give a lower rate just because you asked for a big loan. Overplotting was corrected with an alpha of 0.05.
Figure 19
No trends visible. The “Not employed” factor has a high rate. An employment status of “Not employed” was probably used to determine BorrowerRate.
Figure 20
##
## Pearson's product-moment correlation
##
## data: DebtToIncomeRatio and BorrowerRate
## t = 54.3109, df = 103123, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1608178 0.1726851
## sample estimates:
## cor
## 0.1667575
A small correlation 0.167 for DebtToIncomeRatio may be taken into account in determining BorrowerRate
AmountDelinquent,StatedMonthlyIncome or CreditHistoryLength did not show any significant relationship to BorrowerRate.
Figure 21
It seems that if a borrower did not verify their income the interest rate median was 0.04 percent higher. This was probably a factor in determining interest rate.
Figure 22
There is a trend from low income range $1-24,999 to high income range $100000+. The trend is lower interest rates as income increases.
The LP_NetPrincipalLoss variable values are entered after the loan started. So this will not be used for determining interest rates. If there was a default and a charge off the amount would be entered as LP_NetPrincipalLoss.
## [1] 16266
## [1] 112567
Only 16266 out 112567 have a LP_NetPrincipalLoss greater than zero. That leaves 96301 that have an LP_NetPrincipalLoss equal to zero. There were some negative outliers that will be removed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1776 3294 4693 5991 25000
The median of the none zero values of LP_NetPrincipalLoss is 3294 with a max of 25000!
Figure 23
There is a lot of over plotting and the data is spread out on the y axis. So I will use an alpha of 0.1 and transform the y axis to square root.
Figure 24
Let’s see the median NetPrincipalLoss by day or days of credit history. The range of CreditHistoryLength is 19 to 16479 days with an median of 5706 days (15.6 years).
I’m going to make variables with different size bins( one day, one month, 3 months, one year, 5 years) for CreditHistoryLength to pick the graph with the best bias variance balance.
Figure 25
I like the 4th graph best. It gives a good representation of the overall data and not to much noise.
There seems to be 3 parts to this graph. The first third of the x axis increase slowly then levels off for the middle third and then increases sharply for the last third.
This seems to counter intuitive. I would have thought if a borrower had a longer credit history they would better able to pay of their loans in full.
Figure 26
This show how much of the LoanOriginalAmount is lost if LP_NetPrincipalLoss>0.
What is the median LP_NetPrincipalLoss by LoanOriginalAmount?
Figure 27
Above was created with unchained functions. This time I will chain the functions and try different size bins to see which makes the data easiest to understand.
Figure 28
I like the bin size of $500. Not to much noise and gives a good overview of the data.
Now I will use stat summary to super impose this over the original data.
Figure 29
This shows that if a loans has a non zero LP_NetPrincipalLoss most the loans loses are more than half the original value! This only showed a correlation of r=0.127.
Figure 30
The CreditScore.bucket has an upward trend with LP_NetPrincipalLoss. This probably means that borrowers with good credit can borrow large amounts and have a higher median LP_NetPrincipalLoss.
3)AvgUlCreditScore
Figure 31
With an alpha of 0.01 and jitter(h=0) this trend is visible on this scatter plot too.
What is the median LP_NetPrincipalLoss for the different AvgULCreditScore?
Figure 32
This is informative. The median LP_NetPrincialLoss increases as CreditScore goes up to 850 then it drops sharply.
The below graph show the median line over the original scatter plot.
Figure 33
Let’s look at the same graph but with sum instead of median.
Figure 34
This shows the sum of net principal loss by Credit Score. It shows losses peak at a credit score of about 650 to 700. But the highest median was at 850. This implies there were more loans at a credit score of 650 to 700 than 850. Let’s see if this is true.
Figure 35
This confirms it. Loans count peaks at the high 600’s.
LP_NetPrincipalLoss measures if a loan is bad(LP_NetPrincipalLoss>0) and how bad it is( the amount of LoanOriginalAmount). This makes me wonder which credit scores makes the most profit(Borrower rate * OriginalLoanAmount, if and only if there is no default) and loss and how much.
Which loans make more money? If a loan has a LP_NetPrincipalLoss value of zero this means the principal is being, or has been, paid back with interest. The bigger the loan and the higher the interest rate the more profit. So BorrowerRate multiplied by LoanOriginalAmount will roughly give the profit per loan. Let’s see what factors affect profit.
4)Profit
Figure 36 Mean Overlay
Figure 37 Sum Overlay
Since there are many loans the sum(red line) dwarfs the mean profits. The previous median graph had the highest value about 7000, the highest value on this graph of sum of the profit is 20,000,000. So the individual profit values turns into the black line at the bottom of this graph.
This graph shows most profit is from loans between 650 and 750 credit score.
Which CreditScore makes the most NetProfit(profit - loss)? I will create a new variable: Net_profit = profit - LP_NetPrincipalLoss Net_profit BY AvgULCreditScore
Figure 38
You can see that the median NetProfit for all loans less than a credit score of 520 are negitive! Lets zoom in to see which credit score makes the most NetProfit.
Figure 39
This shows the credit scores that make the most net profit are 690 to about 740. We will look at these variable later in the multivariant section and compare loans pre 2009 to post 2009.
4)Loan Quality Now lets compare LoanQuality to some of the other variable and see if we find any trends.
Figure 40
The ‘AmountDelinquent’ variable represents the amount delinquent the borrower is on all other loans at the time the borrower is applying for this loan. The bad_loan level of LoanQuality variable has a slightly higher median amount delinquent than good_loan. Let’s zoom in.
Figure 41
## LoanQuality: bad_loan
## [1] 1217
## --------------------------------------------------------
## LoanQuality: undetermined_loan
## [1] 1013
## --------------------------------------------------------
## LoanQuality: good_loan
## [1] 998
This figure show the median of each loanQuality group. This figure and the median statistic show about a 20% difference in the median for the good and bad loans. Therefore AmountDelinquent may have an effect on ability to pay of a loan.
Figure 42
## LoanQuality: bad_loan
## [1] 3750
## --------------------------------------------------------
## LoanQuality: undetermined_loan
## [1] 4583.333
## --------------------------------------------------------
## LoanQuality: good_loan
## [1] 5000
The median difference betweem good loans and bad loans for the variable StatedMonthlyIncome is $1250.
Figure 43
This has a significant difference in medians between good and bad loans. This is probably due to bigger loans require better credit and a higher monthly income. Therefore bigger loan amounts have a lower default or chargeOff.
Hummm. Lower percent default and chargeoff. How can we calculate percent default and chargeoff? We could use percent bad_loans.
Figure 44
## LoanQuality: bad_loan
## [1] 4888.5
## --------------------------------------------------------
## LoanQuality: undetermined_loan
## [1] 6088.5
## --------------------------------------------------------
## LoanQuality: good_loan
## [1] 5838
good_loan have a 952 higher median than bad_loan. This implies people with a longer credit history are more likely to have a good_loan.
Figure 45
## LoanQuality: bad_loan
## [1] 669.5
## --------------------------------------------------------
## LoanQuality: undetermined_loan
## [1] 689.5
## --------------------------------------------------------
## LoanQuality: good_loan
## [1] 709.5
40 point creidt score difference. Not suprising borrowers with better credit scores are more likely to pay back loans.
Figure 46
This shows when the outliers are removed the medians are much closer.
5)High Correlation
Now let’s take a closer look at some variables with a higher correlation.
First AvgULCreditScore and BorrowerRate.
Figure 47
An alpha of 1/40 helps visualize the data. This is the highest correlation we have seen so far(r=-0.488). If you have good credit you are more likely to get a lower interest rate (BorrowersRate).
This could explain the graph above with better rates for larger loans. It’s not the larger loan, it the borrowers good credit score.
What is the median BorrowerRate as credit score increases?
Figure 48
This shows as CreditScore increases the median BorrowerRate decreases.
Next AvgULCreditScore and LoanOriginalAmount
Figure 49
As credit score increases the original loan amount increases. Also, only credit scores above 700 get loan between $25,000-$35,000. This implies that credit score is taken into account when granting a loan. The correlation of AvgULCreditScore and LoanOriginalAmount is 0.3519.
##
## Pearson's product-moment correlation
##
## data: loans17$AvgULCreditScore and loans17$LoanOriginalAmount
## t = 126.1691, df = 112565, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3468605 0.3570965
## sample estimates:
## cor
## 0.3519891
Figure 50
How much can a person borrow based on credit? Above is the median amount by AvgULCreditScore.
Figure 51
This shows the median (50th quantile-blue line) of LoanOriginalAmount by AvgULCreditScore. The 10th and 90th quantile(dotted blue line) show how the range of loans increases as credit score increases.
The max line in red gives a little insight into how loans were approved or not based on credit score.
Guidelines may have been: No loans above 35000. Loans above 25000 must have a AvgULCreditScore of at least 710. Loans above 15000 must have a AvgULCreditScore of at least 530. Loans above 2000 must have a AvgULCreditScore of at least 430.
What is the distribution of LP_NetPrincipalLoss over Time(LoanOriginationDate)?
Figure 52
This shows more loans with higher losses occurred before 2009 and a break in new loans in late 2008 and early 2009.
Figure 53
The 3 graphs show the tradeoff between variance and bias. The top graph has the mean for many data point(bin size is one day). This graph shows the changes in the overall pattern of the data(low bias) but it is hard to read. It clearly shows the lack of loans in late 2008 and early 2009. It also has high variance and you can’t really see the mean.
On the other hand the bottom graph has a more accurate measure of the mean because the bin is larger(bin size is one year), lower variance and lower noise. The bad aspect is it misses little changes like the lack of loans (or NetPrinciapalLoss) late 2008 and early 2009. In that time period it has the loss at about $400. The true value is closer to zero. This is high bias and low variance.
I think the middle graph has good balance between bias and varience(bin size is one month).
Figure 54
This has the above mean LP_NetPrincipalLoss by LoanOriginationMonth superimposed over the original data.
What is profit by origination subseting LP_NetPrincipalLoss==0
Figure 55
With an alpha of 1/30 this shows more loans after 2009 have more profitable loans.
Figure 56
Again I like the middle graph(with month size bin). It has balance between bias and variance.
Figure 57
This is the median profit superimposed on the original scatter plot.
Can we combine profit and loss into one graph? Yes, with NetProfit.
Figure 58
It looks like NetProfit started to rise after 2009.
Can we find the changes the bank made in granting loans that can explain this increase in NetProfit?
Above we found theses apparent guidelines for granting loans. No loans above 35000. Loans above 25000 must have a AvgULCreditScore of at least 710. Loans above 15000 must have a AvgULCreditScore of at least 530. Loans above 2000 must have a AvgULCreditScore of at least 430.
Maybe in the multi variant section we can find guide lines for pre and post 2009.
6)Monthly Patterns
I will now look for monthly patterns. Like the Mitchel data.
Figure 59
This is for Loans with a LP_NetPrincipalLoss>0. I’m seeing more density in the middle months. Months 5-8(May-August), with a maxium loss in April.
Figure 60
This is profit for loans with a LP_NetPrincipalLoss==0. Even with an alpha of 0.05 I see less density in months 4-7.
Figure 61
More loans were made in April - August months. So the middle months have more loans, less profit and more loss.
Figure 62
Looks like median loan amounts peak in January (8500.00) and hits a low in May and June ($5000.00). Therefore more loans were made in the late spring and summer months, but for a lower amounts.
Therefor more loans are made in the middle months, these were smaller and made less profit and had more loss of principal!
Income Verifiable vs. Debt to income ration.
I noticed if the data frame is subsetted for (DebtToIncomeRatio less than the 0.99 quantile) most of the IncomeVerifiable==“FALSE” were filter out also.
Are these loans with unsualy high DebtToIncomeRatio AND IncomeVerifiable==“FALSE” being paided back?
Figure 63
I will create a variable NoNetLoss. The NoNetLoss variable is true is if LP_NetPrincipalLoss is zero and false if LP_NetPrincipalLoss is greater than zero.
## F T
## 16266 96301
The percent of NoNetLoss==F to NoNetLoss==T for all the data is, NoNetLoss==F/(NoNetLoss==F to NoNetLoss==T):
## [1] 0.1445006
The percent of NoNetLoss==F to NoNetLoss==T for data with IncomeVerifiable==“FALSE” & DebtToIncomeRatio>10 is: NoNetLoss==F/(NoNetLoss==F to NoNetLoss==T)
## [1] 0.4782609
That is a big difference. The data definition page that came the with data set states if the DebtToIncomeRatio is greater than 10 it is entered as 10.01. Therefore these are not outliers. These are real data point with extremely high values.
This combination of IncomeVerifiable==“FALSE” & DebtToIncomeRatio>10 has a percent of loans with a loss of principal of 47.8% and the overall data only has 14.4% with a loss of principal. That is a big difference.
This combination of IncomeVerifiable==“FALSE” & DebtToIncomeRatio>10 should have been a red flag for possibly fraud. These borrows have a huge DeptToIncome ratio AND they can’t verify their income AND they have a loss of principal 3 times the overall data.
I found a correlation between BorrowerRate and LoanOriginalAmount. I used qplot to see the relationship. Over plotting was present. I used alpha to reduce over plotting but an alpha of 1/10. I used the cor.test inside the with() function to get the Pearson product-moment correlation of -0.331. This demonstrates a significant correlation. There is a correlation but I don’t think the bank would give you a lower rate just because you asked for a big loan.
There is over plotting when comparing BorrowerRate and AvgULCreditScore. Only after using an alpha of 0.05 to was I able to see the true slop pattern of the graph.These variables have a stronger correlation(r=-0.488). This makes sense, if you have good credit score you are more likely to get a lower interest rate (BorrowersRate). This shows as Credit Score increases the median BorrowerRate decreases.
The boxplot of LoanQuality and BorrowerRate shows a significant difference in BorrowerRate for good loans and bad loans. The box plot shows a lower BorrowerRate for good_loans(0.176) than bad_loans(0.239).
It seems that if a borrower did not verify their income the interest rate median was 0.04 percent higher. This was probably a factor in determining intrest rate.
Figure 51 shows as credit score increases the original loan amount increases. It also shows the median (50th quantile-blue line) of LoanOriginalAmount by AvgULCreditScore. The 10th and 90th quantile(blue dotted lines) show how the range of loans increases as credit score increases.
The max line in red gives a little insight into how loans were appoved or not based on credit score. It also gives us some information about how interest rates were determined.
Guidelines: Only, credit scores above 700 get loan between $25,000-$35,000. No loans above 35000. Loans above 25000 must have a AvgULCreditScore of at least 710. Loans above 15000 must have a AvgULCreditScore of at least 530. Loans above 2000 must have a AvgULCreditScore of at least 430.
This supports the idea that credit score is taken into acount when granting a loan!
A)LP_NetPrincipalLoss The median of the none zero values of LP_NetPrincipalLoss is 3294 with a max of 25000!
LP_NetPrincipalLoss was determined and entered into the data set after the loan started so it cannot be used to determine the factors of interest rate determination.
LP_NetPrincipalLoss is compared to CreditHistoryLength, LoanOriginalAmount, AvgULCreditScore and LoanOriginationDate above and discussed below:
1)What is the distribution of LP_NetPrincipalLoss over Time(LoanOriginationDate) Figure 52 shows more loans with higher losses occurred before 2009 and a break in new loans in late 2008 and early 2009.
I used the filter, group_by, summarise and arrange functions form dplyr to get the mean and median data frame with buckets of day, month and year.
The 3 graphs show the tradeoff between variance and bias. The top graph has the mean for many data point(bin size is one day). This graph shows the changes in the overall pattern of the data(low bias) but it is hard to read. It clearly shows the lack of loans in late 2008 and early 2009. It also has high variance and you can’t really see the mean.
On the other hand the bottom graph has a more accurate measure of the mean because the bin is larger(bin size is one year), lower variance and lower noise. The bad aspect is it misses little changes like the lack of loans (or NetPrinciapalLoss) late 2008 and early 2009. In that time period it has the loss at about $400. The true value is closer to zero. This is high bias and low variance.
I think the middle graph has good balance between bias and varience(bin size is one month). I superimposed this on the original data to create Figure 54.
2)The CreditHistoryLength variable was created from the difference between the LoanOriginationDate and FirstRecordedCreditLine. Does the length someone has a credit history effect their interest rate(BorrowerRate) or their ability to pay off the loan without a net loss of principal(LP_NetPrincipalLoss)?
The LP_NetPrincipalLoss was plotted against CreditHistoryLength variable. I made variables with different size bins( one day, one month, 3 months, one year, 5 years) for CreditHistoryLength to pick the graph with the best bias variance balance. There seems to be 3 parts on this graph. The first third of the x axis increase slowly then levels off for the middle third and then increases sharply for the last third.
This seems to counter intuitive. I would have thought if a borrower had a longer credit history they would better be able to pay of their loans in full.
Figure 29 shows that if a loans has a non zero LP_NetPrincipalLoss most of the loans lose are more than half the original value!
5)What is the sum LP_NetPrincipalLoss for the different AvgULCreditScore? This shows the sum of net principal loss by Credit Score. It shows losses peak at a credit score of about 650 to 700.
This makes me wonder which credit scores makes the most profit(Borrower rate * OriginalLoanAmount, if and only if there is no default) and loss.
Which loans make more money? If a loan has a LP_NetPrincipalLoss value of zero this means the principal is being paid back with interest. The bigger the loan and the higher the interest rate the more profit. So BorrowerRate multiplied by LoanOriginalAmount will roughly give the profit per loan. Let’s see what factors affect profit.
What is profit by origination, subseting by LP_NetPrincipalLoss==0?
Figure 57 shows loans after 2009 have more profitable loans the median profit superimposed on the original scatter plot. Monthly bin size was the best balance between variance and bias.
Can we combine profit and loss into one graph??
Can we find the changes the bank made in granting loans that can explain this increase in NetProfit? Maybe in the multi variant section we can find the different guidelines for pre and post 2009.
Net_profit by AvgULCreditScore showed You can see that the median NetProfit for all loans less than a credit score of 520 are negative! After zooming in with coord_cartesian() function we see the credit scores that make the most net profit are 690 to about 740. We will look at these variables later in the multi variant section and compare loans pre 2009 to post 2009.
D)Loan Quality was explored above. The most striking finding was Loan Quality vs AvgULCreditScore. A box plot and statistical analysis shows loans defined as bad had a 40 point lower AvgULCreditScore.
That is a big difference. The data definition page that came the with data set states if the DebtToIncomeRatio is greater than 10 it is entered as 10.01. Therefore these are not outliers. These are real data point with extremely high values.
This combination of IncomeVerifiable==“FALSE” & DebtToIncomeRatio>10 has a percent of loans with a loss of principal of 47.8% and the overall data only has 14.4% with a loss of principal. That is a big difference.
This combination of IncomeVerifiable==“FALSE” & DebtToIncomeRatio>10 should have been a red flag for possibly fraud or a bad risk. These borrows have a huge DeptToIncome ratio AND they can’t verify their income AND they have a loss of principal 3 times the overall data!!
AvgULCreditScore and BorrowerRate An alpha of 1/40 helps visualize the data. This is the highest correlation I have seen (r=-0.488). If you have good credit you are more likely to get a lower interest rate (BorrowersRate). This could explain the graph above with better rates for larger loans. It’s not the larger loan, it the borrowers good credit score that get the lower interest rate.
The median BorrowerRate as credit score increases was explored. Figure 51 shows as credit score increases BorrowerRate decreases especially going from a score of 600 toward 900.
Now that we are in the multivariant section let’s look at differences pre2009 and post2009
Figure 64
We see an increase in BorrowerRate for almost all categories after January 2009. More so with the “Not employed”.
Figure 65
We see an increase in BorrowerRate after Jan_2009_loans. The increase is more prevelant in loans with IncomeVerifiable==FALSE.
Figure 66
Interest rate increases after Jan_2009 for all income ranges, more so for low income and “Not employed”.
Now let’s look at principal loss, profit, and net profit. Above we looked at these 3 with relation to AvgCreditScore. Now let’s look at these 3 with ralation to EmploymentStatusDuration.
Using dplyr I’ll make a data frame and group by EmploymentStatusDuration_5years(5 year bins), EmploymentStatus, NoNetLoss.
Now I will use dcast to cast the NoNetLoss vaiable into a “T” and “F” varible which represent the number of loans that do and don’t have a zero LP_NetPrincipalLoss.
Now we can graph the ratio of LP_NetPrincipalLoss>0 to LP_NetPrincipalLoss==0 for every level of EmploymentStatus over every 5 year bin of EmploymentStatusDuration_5years.
Figure 67
This shows that as the EmploymentStatusDuration goes from zeor to 150 months “Full Time employed” have the highest consistant ratio.
Figure 68
This gives us a lot of information! Apparent pre2009 guidelines for loan approval: no loans above $25,000 no loans above $15,000 with a credit score less than 550 no loans above about $2,000 with a credit score less than 440 no loans with a credit score less than about 380
Apparent pre2009 guidelines for loan approval: no loans above $35,000 no loans above $25,000 with a credit score less than 710 no loans above $15,000 with a credit score less than 650 no loans with a credit score less than about 600
This shows stricter requirements on credit score and higher maximum loans to $35,000. This may explain the increase in NetProfit after 2009 we saw earlier.
Possible Fraud exploration. Which original loan amount are most frequent?
Figure 69
summary(loans17$LoanOriginalAmount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8365 12000 35000
summary(loans17$NoNetLoss)
## F T
## 16266 96301
#16266/(16266+96301)
The overall percent of loans with a NoNetLoss==F(green) is 14%. That is 14% of the above graph is green.
Most loans are for less than $10,000. Most borrowers (purple) are paying(or paid) back the loans with no net loss of principal.
Figure 70
## F T
## 10122 17869
## [1] 0.3616162
## F T
## 6144 78432
## [1] 0.07264472
Fewer loans were made pre2009. Also, the pre2009 graph has 36.2% of loans with a NoNetLoss==F(the green) and the post2009 graph has only 7.3% of loans with a NoNetLoss(green in the right graph above). This implies the bank changed loan requirement policys in 2009 and these changes decreased net loss.
Let’s take this one step further. I want to look at loans with the “unusual combination” of IncomeVeriable==F and DebtToIncomeRation>=10. I want to be able to put these loans into the facet grid as one group. To this I will create a variable ‘pFraud’ that is true if a borrower has an IncomeVeriable==F and DebtToIncomeRation>=10. pFraud stands for posible fraud.
Figure 71
This is ineresting. We can see fewer NoNetLoss==F(green) in the post 2009 loans. NoNetLoss==F represent the loans that have lost some or all of the princical. So, the change in borrowing requirements that appears to have taken effect in 2009 seem to be working to reduce net principal loss.
with(subset(loans17,!is.na(NoNetLoss) &
IncomeVerifiable==T &
DebtToIncomeRatio<10 &
Jan_2009_loans=='pre2009'),
summary(NoNetLoss))
## F T
## 9592 17008
9592/(9592+17008)
## [1] 0.3606015
The pre2009 without the “unusual combination” has a percent of NoNetLoss==F(green) of 36.1%. That is 36.1% of the top left graph is green.
## F T
## 5152 72079
## [1] 0.06670896
The post2009 without the “unusual combination” has a percent of NoNetLoss==F(green) of 6.7%. That is 6.7% of the top right graph is green.
with(subset(loans17, IncomeVerifiable==F &
DebtToIncomeRatio>=10 &
Jan_2009_loans=='pre2009'),
summary(NoNetLoss))
## F T
## 88 95
88/(88+95)
## [1] 0.4808743
The pre2009 with the “unusual combination” has a percent of NoNetLoss==F(green) of 48.1%. That is 48.1% of the bottom left graph is green. This graph has the highest percent green. This implies there is a chance of a bad bank policy in granting loans in pre2009 or possible fraud.
Now look at the two graphs in the first column. These look at pre2009 loans. The loans that have this unique combination(the bottom left graph) show a higher percentage (48.1%) of green (loss of principal) than the pre2009 loans without this unique combination(top left graph),36.1%.
Look closely at the graphs of pre2009 loans with the unique combination(bottom left). There are 20 loans of $25,000. It looks like 13 of the 20 have a loss of principal. Why would a banker give the maximum loans($25,000 at the time) to a person with a DebtToIncomeRatio of greater than 10 and who can’t varifiy thier income? Fraud, maybe, or bad policy. Now looking at the post2009 loans with the unique combination(bottom right graph). This show that the new loan requirements that started in 2009 are working. There is only one loan post2009 with that unique combination and it has no net loss of principal!
This “profit”" variable is just a quick way to evaluate loan quality. It Multiplies LoanOriginalAmount by BorrowerRate.
Figure 72
I like the graph with the $500 bin. It has a balance between bias and variance. We see bigger loans on average make more money.
In addition to creating a “profit mean” variable I also created a “profit sum” variable so we can add up all the profit values for the bins.
Figure 73
With these I like the graph with the $3000 bin size. Looks like most profit is from loans under $10000.
This creates a data frame that group_by LoanOriginationMonth and IncomeVerifiable.
Figure 74
This shows that borrower without an income verification have a higher average loss of principal than borrower with an income verification.
What is the ratio of non-verified income to verified income?
Figure 75
This shows that ratio of (LP_NetPrincipalLoss with IncomeVerification==FALSE) to (LP_NetPrincipalLoss with IncomeVerification==TRUE). Clearly loans without income verification have higher LP_NetPrincipalLoss_mean. One time the ratio to loans with income verification is greater than 3 to 1 and frequently 2 to 1.
This may imply that the borrower may not be truthful when giving stated income.
#CreditScore.bucket
loans19.Loss_by_OriginationMonth_wCreditScore.bucket<-loans19%>%
filter(!is.na(LP_NetPrincipalLoss & LP_NetPrincipalLoss>0))%>%
group_by(LoanOriginationMonth, CreditScore.bucket)%>%
summarise(LP_NetPrincipalLoss_mean=mean(LP_NetPrincipalLoss),
LP_NetPrincipalLoss_median=median(as.numeric(LP_NetPrincipalLoss)),
n=n())%>%
arrange(LoanOriginationMonth, CreditScore.bucket)
#head(loans19.Loss_by_OriginationMonth_wCreditScore.bucket)
This utilizes dplyr and aggragates the data. This groups by LoanOriginationMonth AND CreditScore.bucket.
Figure 76
Now we can see the mean for all the CreditScore.bucket levels. This shows how after January 2009 no loans were made to a borrower with a credit score lower than 600. Also this shows before 2009 the average principal loss was much higher. The dotted line is the grand mean.
Figure 77
This shows the sum of the losses(LP_NetPrincipalLoss) by CreditScore.bucket and the overall sum of the losses.
Now, can we do this with profit?
Figure 78
This show that before 2009 the top two credit buckets made most of the profit. This may be why they changed police to only make loans to borrowers with credit score of 600 or better.
Now we can get grand sum the profit.
Figure 79
Wow! This shows profit approaching 10 million at about 2014. Let’s compare this to the losses and see if the changes that were made after 2009 improved NetProfit.
Now let’s see profit minus LP_NetPrincipalLoss, to get NetProfit.
Figure 80
This shows a negative net profit before 2009 and slow rise in net profit though2012 and then a spike to about $10,000,000 by about 2014!
Figure 81
This shows that CreditScore.buckets increase profits as income ranges increases within and across credit score buckets.
Figure 82
Samething for LP_NetPrincipalLoss. It increases as CreditScore.bucket increases for all income ranges.
Figure 83
Above shows most Netprofit comes from the highest two credit score buckets and higher income ranges.
Here I created a “LoanOriginationYear.bucket”.
Figure 84
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2006 2009 2012 2011 2013 2014
##
## 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 5205 11317 11469 2036 5627 11163 19447 34179 12124
This shows how percent of bad loans decreases over time. Also it shows 2013 about 34000 loans!
Figure 85
Now that the scales are free we see that the percent of bad loans decrease as credit score bucket increase. Here green is a good loan.
Figure 86
This was why I created the LoanQuality variable. There are to many levels to make sense. So I converted the ‘Completed’ and ‘Current’ to the level ‘good_loan’ in the LoanQuality variable. And the ‘Cancelled’ and ‘ChargedOff’ levels were converted to ‘bad_loan’ and everything else ‘undetermined_loan’. The conversion turned the above busy graph in the the easier to understand graph below.
Figure 87
Figure 88
We can see the after January 2009 only loans with a credit score of 600 or better recieved loans.
Figure 89
With alpha of 1/50 it is ease to see most of the NetProfit points are positive and the most profitable loans are from income ranges of $75-$99,000 and $100,000. Without alpha most points look negative. Let’s zoom in to see the median line.
Figure 90
The pre2009 subset shows a median NetProfit line below a credit score of about 520 is all negative NetProfit. Also, post2009 loans have a median NetProfit line is over twice as high as the pre2009 subset.
Figure 68 gives great insight into some of the requirements for maximum loan amount and interest rates.
Figure 78 shows that before 2009 the top two credit buckets made most of the profit. This may be why they changed police to only make loans to borrowers with credit score of 600 or better.
Figure 79 Wow! This shows profit approaching 10 million at about 2014.
Figure 80 shows a negative net profit before 2009 and slow rise in net profit though2012 and then a spike to about $10,000,000 by about 2014!
Statistical evaluations and figure 70 shows fewer loans were made pre2009. Also, the pre2009 graph has 36.2% of loans with a NoNetLoss==F(the green) and the post2009 graph has only 7.3% of loans with a NoNetLoss(green in the right graph above). This implies the bank changed loan requirment policys in 2009 and these changes decreased net loss.
Figure 74 LP_NetPrincipalLoss_mean vs LoanOriginalMonth shows that borrower without an income verification have a higher average loss of principal than borrower with an income verification.
Figure 83 shows most Netprofit comes from the highest two credit score buckets and higher income ranges.
Figures 10 and 11 show that, loans granted with a debt to income ratio of greater than 10 are rare. Also, figure 14 shows that loans without income verification are also rare. So finding loans with both of these two rare factors is bad loan granting policy or possible fraud. To examine these loans closer a variable pFraud was created. The variable value was “T” for loans with both “income not verified” and “a debt to income ratio greater than 10”. The variable value was “F” for all other loans. Also, in late 2008 there was a market crash and, as figure 68 shows, loans were granted differently after January 2009. Therefore a variable was created and given a value of “pre2009” for loans granted before January 2009 and “post 2009” for loans granted after January 2009.
This graph was created to explore the possible fraudulent loans. Also to see want affect loan granting criteria, both pre and post January 2009, had on net loss.
The graph was colored purple for loans without a loss of principal or interest. The graph was colored green for loans with a loss of principal or interest. The most interesting part of this is the bottom left graph. These are the loans written before 2009 and where pFraud is true. That is, loans written before January 2009 with no income verification and a debt to income ration greater than 10. As the statistical evaluation above (under figure 71) show these loans have a percent green or “F” of 48%. That is 48% of the loans in the bottom left lost money. That is 12% higher loss rate for other loans of the same time period, “pre2009”.
That is, the top left graph.
Therefore the policy of granting loans without income verification and a debt to income ratio of greater than 10 (bottom left graph) is at least bad loan granting policy or at worst fraud. The most sticking part of the bottom left graph are the 20 loans granted for $25,000, 13 of these lost money. It would be interesting to see if the same loan agent wrote these loans. Also, to see if his or her relative got the loans.
Also, we saw in the graph 68 the loans granted post 2009 had stricter criteria. It appears these stricter criteria greatly reduced net loss. There is much less green in the two graphs on the right compared to the left.
The graph above shows the distributions of loans with the original loan amount vs. the credit score of the borrower. This was done for loans granted before January 2009(pre2009) and after January 2009(post2009). In late 2008 there was a market crash and it appears loans were granted differently before and after January 2009. This graph was created to see if we can determine what the criteria for granting a loan were before January 2009 and if the criteria changed after January 2009.
The blue line represents the median loan amount over the different credit scores of the borrowers. An alpha of 1/40 was used to better visualize the distribution. The dotted blue lines below and above the median line represent the 10th and 90th Quintile, respectively. The dotted red line represents the sum of all loans across the credit scores of the borrowers.
From these graphs we can clearly see some of the loan criteria. For example prior to January 2009 the maximum loan was $25,000 and after January 2009 the maximum loan was $35,000. Below are other apparent criteria for granting loans before and after January 2009:
This may explain the increase in NetProfit after 2009 we will see below.
Subtracting the losses from the profit for a loan created the NetProfit variable. The distribution of the loans net profit by credit score of the borrower were plotted. An alpha of 0.02 was used to better visualize the data. Again, the distribution was looked at for loans before January 2009(pre2009) and after 2009(post 2009). This was done to see if the market crash and the new loan criteria after January 2009 had an effect on net profit. The red line shows median profit for every loan in the data set by the different credit scores. Also, the loan data points were colored by income range of the borrower.
This shows a negative net profit for loans granted before January 2009 and with a credit score less that 550. It also show more loans with a high net profit, between $2500 and $5000 after 2009 are predominantly from loans with an income of $75,000-$99,999(light green) and greater than $100,000(green). The plot shows no loans after January 2009 with a credit score lower than 600. The red lines show that after January 2009 the median net profit was higher over all credit scores. This implies the new criteria for granting loans after 2009 and the improving economic contributed to increased net profits.
The prosperLoanData was analyzed. The data set contained 113937 data point with 15 variables. Through the exploration some variables were added and some were remove. First variables were looked at individually. Some variables needed wrangling. For example some loan origination date had a year of 2060. These were removed. Many variables were created to give insight into relationships between variables. Examples of these were average credit score and net profit. These new variables enabled understanding of loan rate requirements and where and when loans were the most profitable. Eventually many variables were compared at once to draw conclusions. For example the created variable pre and post 2009 loans, income range, average credit score (in credit score buckets), net profit (derived from the profit and loss variables) came together in final plot 3 to give insight into amount of profit by whom and when. To investigate this data further I would look to see, from the original 81 variables in the data set, if one loan agent granted those loans that were bad or possibly fraudulent. Also, I would do a linear regression to try to evaluate if the derived loan criteria for loan rate and maximum loan amount could create a model.
I struggled to understand why there were no loans granted from the end of 2008 to mid 2009, figure 52. Also, why I was seeing more loans prior to January 2009 but more profit after January 2009, figure 58. Plotting net profit variable vs. loan origination month (figure 80) and net profit by credit score (figure 90) gave great insight into what was going on. Figure 80 showed that two credit score accounted for most of the profit. Figure 90 used facet-wrap to show loans granted before and after January 2009. This showed negative net profit in lower credit scores prior to January 2009. It also showed a change in loan granting criteria and much more profit after January 2009. I successfully understood the reason why more loans prior to January 2009 made less net profit than fewer loans after January 2009. The reason was because pre January 2009 loans were granted with bad criteria and also a bad economy.
Resources: 1) http://www.inside-r.org/node/230680, 2) http://stackoverflow.com/questions/21653295/dplyr-issues-when-using-group-bymultiple-variables, 3) http://stackoverflow.com/questions/24880835/how-to-melt-and-cast-dataframes-using-dplyr, 4) http://www.statmethods.net/advgraphs/axes.html